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Abstract 

Background: RNA-Seq technology measures the transcript abundance by generating sequence reads and counting 
their frequencies across different biological conditions. To identify differentially expressed genes between two 
conditions, it is important to consider the experimental design as well as the distributional property of the data. In 
many RNA-Seq studies, the expression data are obtained as multiple pairs, e.g., pre- vs. post-treatment samples from 
the same individual. We seek to incorporate paired structure into analysis. 

Results: We present a Bayesian hierarchical mixture model for RNA-Seq data to separately account for the variability 
within and between individuals from a paired data structure. The method assumes a Poisson distribution for the data 
mixed with a gamma distribution to account variability between pairs. The effect of differential expression is modeled 
by two-component mixture model. The performance of this approach is examined by simulated and real data. 

Conclusions: In this setting, our proposed model provides higher sensitivity than existing methods to detect 
differential expression. Application to real RNA-Seq data demonstrates the usefulness of this method for detecting 
expression alteration for genes with low average expression levels or shorter transcript length. 



Background 

Gene expression profiles are routinely collected to identify 
differentially expressed genes and pathways across various 
individuals and cellular states. Sequencing-based tech- 
nologies offer more accurate quantification of expression 
levels compared to other technologies. Early sequence- 
based expression measured transcript abundance by 
counting short segments, known as tags, generated from 
the 3' end of a transcript. Tag-based methods include 
the Serial Analysis of Gene Expression (SAGE, [1]), Cap 
Analysis of Gene Expression (CAGE), LongSAGE, and 
massively parallel signature sequencing (MPSS). The 
development of deep sequencing technology enables 
simultaneous sequencing of millions of molecules and has 
led to advanced approaches for expression measurement 
[2,3]. Digital gene expression - tag profiling [4] adapted 
the tag-based approach for use with the next-generation 
sequencing platform. RNA-Seq is an alternative approach, 
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that is an application of whole genome shotgun sequenc- 
ing! Briefly, it entails generating a cDNA library by ran- 
dom priming off of fragmented RNA. The cDNA library 
is then subject to next-generation sequencing to generate 
short nucleotide sequences (reads) that correspond to the 
ends of the cDNA fragments. RNA-Seq aims to measure 
the entire transcriptome and is preferable to microarrays 
and tag-based approaches since it provides more infor- 
mation such as alternative splicing and isoform-specific 
gene expression with very low background signal and 
a wider dynamic range of quantification [5]. Moreover, 
recent experiments revealed that the RNA-Seq measures 
expression level with high accuracy and reproducibility 
[6-9], 

Sequence-based approaches quantify gene expression 
as a 'digital' count and require modeling suitable for 
a count random variable. The Poisson distribution has 
been central in modelling expression data [10-12] and 
commonly applied to RNA-Seq data [6,13]. In partic- 
ular, Li et al. (2012) proposed a permutation-based 
approach to generate the null distribution [14]. However, 
Poisson-based approaches may not take all the varia- 
tions between biological samples into account. The Beta- 
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Binomial hierarchical model [15,16], overdispersed logis- 
tic [17], and overdispersed log-linear models [18] were 
proposed to capture extra variance for each gene sepa- 
rately. Negative Bionomial models have been proposed 
to estimate the overdispersion parameter by shrinkage 
estimation [19-21], mean-dependent local regression [22], 
or empirically derived prior distribution [23]. Alter- 
natively, beta-binomial [24] and Poisson mixture [25] 
models were proposed under the Bayesian modeling 
framework. Nonparametric method with resampling was 
also considered [26]. These approaches generally assume 
that samples under two groups are obtained indepen- 
dently. Recently, some of these approaches have been 
extended to deal with multi-factor design structures 
[14,16,21,22]. 

Many practical RNA sequencing studies collect data 
with a paired structure, where the global expression pro- 
files are measured before and after a treatment is applied 
to the same individual. Appropriate modeling of such data 
requires taking this design structure as well as the distri- 
butional property of the data into account. The Poisson 
model has been used to test the effect of drugs when 
the observation occurs as paired data, such as predrug 
and postdrug counts [27]. Lee [28] considered a mix- 
ture model to account for extra variance among indi- 
viduals over the level that would be expected under the 
Poisson model. These approaches assume independence 
of the paired observation conditional on the individual 
mean. Bivariate Poisson or negative binomial distribu- 
tion are alternative choices to model correlations between 
observations [29,30]. 

In this paper, we propose a Bayesian hierarchical 
approach to modeling paired count data that separately 
accounts for the within and between individual vari- 
ability from a paired data structure. Our work adopts 
the Poisson-Gamma mixture model [28] and utilizes a 
Bayesian approach to evaluate the expression difference. 
We note that the Bayesian models are widely utilized in 
microarray studies and have improved sensitivity to detect 
differential gene expression by sharing information among 
genes [31]. Mixture models are also commonly used 
to model differential expression, where non-differentially 
expressed and differently expressed genes correspond to 
different mixture components. Various mixture model 
specifications have been considered in the literature. 
The gamma and log-normal distribution were used to 
model the expression levels [32,33]. Smyth [34] assumed 
a point mass at zero for log scaled fold change for 
null genes and a normal distribution centered at zero 
for non-null genes. Lonnstedt et al. [35] and Gottardo 
et al. [36] proposed a mixture of two (null and non- 
null) or three normal (null, over, and under expres- 
sion) distributions. Non-parametric approaches have 
also been utilized [31,37]. Lewin et al. [38] discussed 



various choices of mixture component priors and model 
checking. 

The rest of this manuscript is organized as follows. Data 
Section introduces the biological problem and data that 
motivated this study. Methods Section presents our para- 
metric model and the Bayesian method to identify genes 
with differential expression levels. The performance of the 
proposed model is examined by Simulations. Two sets 
of simulation studies are conducted: (1) those based on 
the model assumption to investigate the accuracy of the 
proposed method on parameter estimation, and (2) those 
based on mimicking the motivating data set to exam- 
ine the robustness of the proposed method. Finally, the 
proposed method is applied to real data with detailed 
discussion of the results and comparisons with other 
methods. 

Data 

Qian et al. (Qian F. et al: Identification of genes critical 
for resistance to infection by West Nile virus using RNA- 
Seq analysis, submitted) designed an RNA-Seq experi- 
ment to study human West Nile virus (WNV) infec- 
tion. One objective of this study was to identify altered 
genes/transcripts from viral infection of primary human 
macrophages in comparison to uninfected samples. This 
study naturally has a paired design structure. A total 
of 10 healthy donors were recruited according to the 
guidelines of the human research protection program of 
Yale University and cells were isolated from fresh hep- 
arinized blood samples for infection with WNV (strain CT 
2741, MOI=l, for 24 hours) as described previously [39]. 
PolyA+ RNA was prepared from uninfected and WNV- 
infected primary macrophages, fragmented, and sub- 
jected to sequencing using the Illumina Genome Analyzer 
2. Approximately 50 million quality filtered reads were 
obtained from each sample, and about 85% were mapped 
to the human transcriptome (hgl9) with ENSEMBL tran- 
script annotations (Release 57) using TOPHAT v. 1.1.4 
[40]. Genes and transcript isoforms were scored for 
expression by a maximum likelihood based method imple- 
mented in Cufflinks v.0.9.3 [41]. To analyze differential 
expression, the data were first converted from the FPKM 
unit (fragments per kilobase of exon per million fragments 
mapped) to the number of reads originated from each 
transcript isoform. The trimmed-mean method [42] was 
applied to further normalize the count expression values. 
The processed data contains transcript-level expression 
counts from a total of 20 samples consisting of 10 pairs 
of uninfected and virus infected samples. For differen- 
tial expression analysis, we removed transcripts with less 
than 10 total counts across 10 uninfected samples or no 
observed count from 6 or more individuals in the unin- 
fected conditions. After these steps, 37,111 transcripts 
were considered for data analysis. 
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Methods 

Bayesian mixture model for paired counts 

We now describe our Bayesian hierarchical mixture model 
to identify differentially expressed genes/transcripts from 
paired RNA-seq data. As noted above, such data arise nat- 
urally from experiments measuring the biological change 
from treatments. We start with an overdispersed count 
model [28]. The observations are denoted by a pair 
(Ygn, Ygi2), for gene g = 1, . . . , G and individual i = 
l,...,n, where Y g n is the observed baseline expression 
level and Y g i2 is the observed level after treatment. The 
sizes of the libraries are denoted as Na and Nq, respec- 
tively. Let kgi denote the true baseline expression relative 
to the library size. Then, Y g n can be modeled as a Pois- 
son random variable with mean kgiNn. Let Xg denote the 
expression level fold change after treatment so the true 
expression level is Xg^gi^i2> then Y g i2 can be modeled as a 
Poisson random variable with mean Xg^gi^i2- Our goal is 
to test whether there is any treatment effect, i.e., Xg 7^ h 
where 

Y g a\kgi,Xg ~ Poisson(^i^), 

Y g i2\^gh Xg ~ Poisson(AfeA^;fe). (1) 

It has been shown that the variability among techni- 
cal replicates for RNA-Seq data can be captured by the 
Poisson distribution [6]. However, greater variance can 
be expected, if observations are collected from individu- 
als with differences in the underlying biological system. 
One way to model the overdispersion among the Pois- 
son counts is to mix it with a Gamma distribution [28]. 
In this model, we use a Gamma distribution to model the 
baseline expected expression, kg, across individuals with 
shape parameter a g and rate f} g ; 

1 \0C g ) 

This model allows us to obtain a simpler form of the 
predictive density, i.e., the kg's can be integrated out 
(see Appendix). 

Assuming independence between the baseline expres- 
sion and treatment effect, we use a two-component mix- 
ture model to characterize the fold change distribution, 
where the expression change state of each gene is denned 
by a latent variable z g , with z g = 0 corresponding to 
no change and z g = 1 otherwise. We assume that z g 
has a probability of tzq for equal expression, i.e., z g = 0, 
and a probability of tt\ = 1 — no for differential expres- 
sion. Given a state, 0 or 1, the log-scaled fold change 
is assumed to follow a normal distribution. Under equal 
expression, the log-fold change is assumed to arise from a 
normal distribution centered at zero and variance <7q . For 
genes with differential expressions, if we assume their log- 
fold changes follow a normal distribution centered around 



zero, we implicitly assume that there is equal chance for 
a gene to be either over or under expressed. However, 
more genes were under-expressed after the viral infection 
for the data set described earlier, with 3.2% of transcripts 
showing increased expression by more than 4 fold after 
the infection whereas 4.3% showing reduction by more 
than 4 folds. To accommodate this asymmetry, we assume 
the log-fold change for non-null genes arises from a nor- 
mal distribution with mean /xi, which may be different 
from 0, and variance o\. 

\o%{Xg)\{z g = 0) ~ Normal(0, a 0 2 ) 
log(x^) I (% = 1) ~ NormalOi, of) 

Collecting all the components discussed so far, the 
model can be summarized in Figure 1. Under this set-up, 
the goal is to estimate the posterior probability that a spe- 
cific gene is differentially expressed after treatment, i.e., 
Pr(z g = l|data). Genes can then be inferred to DE (Differ- 
ential Expression) or EE (Equal Expression) according to 
these probabilities. 

To complete our model description, we need to spec- 
ify prior assumptions for the unknown model parameters, 
0 = ({a g }, {fi g }, 7i q, 7i i, o-q , /jii, of). In our implementation, 
we assume non-informative priors for these unknown 
parameters: 

1. (7To,7Ti) ~ Dirichlet(l,l), i.e., ttq ~ Uniform(0, 1). 

2. Each a g and /3 g has a non-informative prior. 

3. p((Tq) oc 1/<Tq and^o-j 2 ) oc l/of. 

4. fii has an improper prior. 

5. Joint independency among all the parameters. 

Parameter estimation via Markov chain Monte-Carlo 
(MCMC) 

In this section, we describe the Gibbs sampling algorithm 
[43] that we use to iteratively sample model parame- 
ters from their conditional distributions given the other 
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Figure 1 Diagram illustrating the hierarchical model for paired 
RNA-Seq data. 
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parameters and the observed data. First, we evaluate the 
conditional distribution of parameters (a g , /3 g ) charac- 
terizing the baseline expression distribution (X g i). These 
parameters are separately updated using the Metropolis- 
Hastings algorithm. For the latent state z g and expression 
level change Xg> the state % is nrst proposed and then 
X g is sampled given the state. Lewin et al. [38] discussed 
this type of move with various choices of the mixture dis- 
tribution. Details of our updates on the pair of (Xg>Zg) 
are described in the Appendix. Mixing proportions (7To, 
7i i) and hyper-parameters for the mixture distribution 
(<7q , crj 2 , /jL\) are sampled from their conditional posterior 
distributions which can be derived in closed forms. 

DE classification and false discovery rate estimation 

The MCMC algorithm generates random samples from 
the joint posterior distribution of all model parameters. 
These samples are then used to infer the status of differ- 
ential expression. One way to select a set of interesting 
genes is to rank genes using estimated posterior-mean fold 
change 



f^exp|i^log( X f )J, 



(3) 



where T is the number of iterations used for inference 
after the burn-in period and Xg^ is the sampled value 
for the fold change on iteration t of the Gibbs sampling 
algorithm. Another way to select DE genes is to consider 
the latent variable, z g . During the MCMC iteration, the 
expression state is sampled along with the fold change 
estimates. These MCMC samples can be used to approxi- 
mate the posterior probability of differential expression by 
counting the proportion of sampled states being differen- 
tially expressed: 



1 T 

p g = P(z g = \\datd) { z f = X ) ' 

t=i 

The Bayes' rule assigns a genes expression status 
according to the largest posterior probability. An alter- 
native is to classify a gene if the posterior probability of 
being non-null is greater than a threshold (pthres) : Pg > 
Pthres- F° r example, one choice would be pthres = 0.5. The 
false discovery rate can be estimated from these posterior 
probabilities [31]: 



FDR ■ 



ti(p g > Pthres) g:pg>mres 



(4) 



Results and discussion 

Simulations 

Simulations based on the model assumptions 

The first part of the simulation was conducted to examine 
the performance of the proposed approach when the data 
are generated under the model assumptions. For 10,000 
genes and 10 individuals, we simulate expression counts 
both before and after treatment according to Equation 1. 
Library sizes are sampled uniformly from 7 to 18 millions 
and relative expected baseline expression X g i are drawn 
from a Gamma distribution with shape 0.1 and rate 1,000. 
For simplicity, we consider a two-component log-normal 
mixture model for effect size. For the null genes (90%), 
the log-scaled effect is sampled from a normal distribu- 
tion with a mean 0 and a standard deviation (ao) 0.1, 
whereas the log-effects are sampled from a normal dis- 
tribution with mean (/xi) of 1.5 and standard deviation 
(ai) of 0.5 for the non-null genes. For the simulation 
studies, the true library sizes are used for the parameter 
estimation. 

Results in Table 1 show that the proposed approach 
estimates the model paramters well. With a posterior 
probability cutoff of 0.5, the algorithm identified more 
than 97% of true DE genes with an FDR of approximately 
1%. Figure 2 illustrates the estimated fold changes showing 
the good performance of our algorithm. 

Simulations based on the empirical data 

In the second part of the simulation, we assume that the 
expression abundance is measured for 5,000 genes simul- 
taneously before and after a given treatment. The number 
of individuals is set to be 10 for the relatively larger sam- 
ple case (cases 1 and 4), 5 for the medium (cases 2 and 
5), and 3 for the relatively smaller sample case (cases 3 
and 6). The size of each library is randomly sampled from 
1.8 to 3 million to have simulated count distribution com- 
patible with the real data distribution. The infected set of 
the RNA-seq data (Data Section, Qian F. et al. for details) 
was used as the expected baseline count data to mimic 
the observed mean-specific dispersion. First, we sample 
5,000 gene indices with replacement to get the expected 
baseline expression. Expression counts from the selected 
indices are summarized by a matrix where rows from 
this data matrix correspond to the selected genes in the 



Table 1 Posterior means of the parameters in the model 



The method was implemented in R and is available at 
http://bioinformatics.med.yale.edu. 



Parameters 


True parameter 


Posterior mean 




0.01 


0.013(0.002) 




1.5 


1.501 (0.015) 




0.25 


0.238 (0.015) 


7T1 


0.1 


0.099 (0.001) 
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original data matrix and columns correspond to individu- 
als. Then, the relative expression {^ g i,i=i,...,N> Equation 1) 
is computed proportional to the total counts in each 
sample. 

Among 5,000 genes, the first 4,000 are assumed to have 
no change (z g = 0) and their log-fold changes, log(xg), 
are sampled from a normal distribution with a mean of 0 
and a variance of 4 x 10 -4 . For the rest of non-null genes, 
we considered the following two scenarios. An empirical 
set-up (cases 1, 2, 3) utilizes nominal fold change from the 
uninfected data set. Cases 4, 5, and 6 consider a theoretical 
setup, where the log-scaled fold change is drawn from a 
normal distribution with a mean of zero and a variance 
of 1. We further filter out non-null genes whose true fold 
changes are less than 1.4. 

Each case was repeated 100 times. We compare the 
performance of our approach with DESeq (version 1.8.3) 
[22] and edgeR (version 2.6.10) [21], two widely used 
methods for RNA-seq data for the purpose of identifying 
differentially expressed genes. These two methods assume 
a negative binomial distribution to explain the variance 
due to the replicate. DESeq utilizes a smoothing curve 
to compute the overdispersion as a function of the aver- 
age expression level. An option pooled-CR' is used to 
estimate the overdispersion parameter [44]. In edgeR, a 
common dispersion setting is used which assumes a con- 
sistent overdispersion across all the features and estimates 
the parameter using a common likelihood function. A 
paired design can be incorporated by utilizing generalized 
linear model. For each application, the true library sizes 
are used as the library size inputs. 

Table 2 summarizes the results of our approach. Overall, 
we see excellent performance of our method in inferring 
the expression change status (reflected in a high corre- 
lation with the true status) as well as the parameters 
characterizing the distributions for the null and non-null 
genes. Since true expression states are known in the sim- 
ulation, we call a feature to be differentially expressed if 
p g > Pthres an d compare the estimated false discovery 
rate with the true value (Figure 3). The FDR is estimated 



well for cases with large sample sizes as pthres increases, 
while it is slightly under-estimated for small sample sizes. 
Figure 4 illustrates the receiver operating characteristics 
averaged across 100 simulations under four different sim- 
ulation settings. For each setting, the true positive rate 
is plotted against the false positive rate. The correspond- 
ing rates are computed by ranking genes from the largest 
posterior probability by the Bayesian approach (then, the 
largest fold change, if tied) or from the smallest p-value by 
each of the other methods. The Bayesian approach shows 
higher sensitivity at the same level of false positive rates 
than the edgeR and DESeq. Especially, the Bayesian model 
achieves better performance for smaller sample size and 
empirical fold change setting (case 2 or 3). 

We further considered a simulation scenario simi- 
lar with the real data. As shown in the data appli- 
cation, the log-scaled fold change estimated from the 
data has larger variance under null component. We set 
the null component variance to be 0.35 and repeated 
the simulation 50 times. For features in the non-null 
group, log-fold change was sampled from a normal dis- 
tribution with a mean of -0.45 and a variance of 4. 
Simulation was performed with the sample size of 10 
(case 7) and the size of 5 (case 8). Averages of the 
parameter estimates (/xi, <7q , aj 2 , 7i\) for cases 7 and 8 
are (-0.42,0.35,3.92,0.20) and (-0.42,0.35,3.85,0.21), 
respectively. Similarly with the cases 1 through 6, the esti- 
mated false discovery rate is examined (Figure 3) and 
performance of the proposed approach is compared with 
two existing methods (Figure 4). 

Applications 

Differential expression analysis with the Bayesian modeling 

In this section, we apply our method to the motivating 
data set described in the Data Section. Initial values of 
the model parameters are calculated directly from the 
data. The MCMC sampling is run 4,000 iterations after 
discarding the first 8,000 iterations. On average, computa- 
tional time was around 5 minutes per every 100 iterations. 
The number of total iterations and burn-in period are 
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Table 2 Estimated posterior means and results for 
empirical simulation 





Casel 


Case 2 


Case 3 


w 


10 


5 


3 




-0.170 (0.037) 


-0.169 (0.041) 


-0.157(0.041) 




3.653 x 10- 4 


3.604 x 10- 4 


3.83 x 10- 4 




(3 x 10- 5 ) 


(4.421 x 1CT 5 ) 


(6.090 x 10- 5 ) 




0.984 (0.104) 


0.968 (0.115) 


0.955 (0.110) 


71] 


0.151 (0.004) 


0.153 (0.005) 


0.156(0.006) 


COriXgSgT 


0.972 (0.006) 


0.993 (0.003) 


0.953 (0.011) 


FDR 


0.030 (0.008) 


0.046 (0.011) 


0.068 (0.013) 


FDR 


0.024 (0.004) 


0.037 (0.005) 


0.049 (0.006) 


Sensitivity 


0.928 (0.014) 


0.866 (0.020) 


0.802 (0.025) 


Specificity 


0.995 (0.001) 


0.994 (0.002) 


0.991 (0.002) 




Case 4 


Case 5 


Case 6 


N 


10 


5 


3 


Mi 


0.007 (0.035) 


0.006 (0.038) 


-0.002 (0.037) 




3.634 x 10- 4 


3.532 x 10- 4 


3.450 x 10- 4 




(2.931 x 10- 5 ) 


(4.155 x 10- 5 ) 


(5.283 x 10- 5 ) 


°t 


1.172 (0.048) 


1.151 (0.059) 


1.140 (0.050) 


71] 


0.179 (0.003) 


0.183 (0.004) 


0.188 (0.005) 


cor(x g ,Xg)* 


0.990 (0.002) 


0.979 (0.004) 


0.965 (0.007) 


FDR 


0.030 (0.008) 


0.044 (0.009) 


0.064 (0.012) 


FDR 


0.021 (0.004) 


0.031 (0.005) 


0.042 (0.006) 


Sensitivity 


0.953 (0.011) 


0.906 (0.015) 


0.862 (0.020) 


Specificity 


0.995 (0.001) 


0.992 (0.002) 


0.989 (0.002) 



Operating characteristics are based on the Bayes rule. cor(xg, Xg)* is the 
correlation coefficient between the true difference and the estimated difference. 



determined by monitoring trace plots of MCMC sam- 
ples (Figure 5 (a)). We estimate the mixing proportion to 
be 0.88 and 0.12 for EE and DE group, respectively. The 
posterior means for the parameters \i\ and o\ are -0.45 
and 4.04, respectively. The null group has a variance of 



0.35. Under the Bayes rule (pthres = 0-5), 2,352 transcripts 
are classified into DE after the West Nile virus infection. 
The estimated FDR is 16.2% from Equation 4. Figure 5 
(b) illustrates the fold change distributions under DE and 
EE based on the Bayes rule classification. The estimated 
fold changes are plotted in Figure 6 (a) against their DE 
posterior probabilities. 

Comparisons with existing methods 

In this section, we compare DE analysis results between 
our approach and existing methods. The DESeq or edgeR 
is applied to the same data set and top 2,352 DE tran- 
scripts are selected by their p-values. The edgeR shows 
higher consistency with our Bayesian model with 63.5% 
of overlap than the DESeq having 34.3% of overlapping 
transcripts. Specifically, 832, 632, and 1,364 transcripts 
are detected uniquely by the Bayes, edgeR, and DESeq, 
respectively (Figure 6). Our approach detects those having 
low average expression and high fold change. In contrast, 
other approaches tend to identify more transcripts with 
high expression level and low fold change (Figure 7). Tran- 
scripts which have evidence of differential expression only 
by the proposed model often have large inter-individual 
variation. Their fold changes are high after the treatment 
except a few low expressed individuals. Figure 8 illustrates 
an example of uniquely identified transcript by our pro- 
posed approach. This transcript is a product of SLAMF7, 
which is known to play a role in natural killer cell acti- 
vation [45]. Another interesting feature of the proposed 
method is that the proportion of DE genes is consis- 
tent across transcript length. Among the bottom 10% of 
the short transcripts, 4.6% are detected by the proposed 
approach while 2.4% are found by other methods. Among 
the top 10% of the long transcripts, 6.5% are detected by 
the proposed method whereas 7.4 and 8.9% are detected 
by DESeq and edgeR, respectively. To investigate more 
details, Figure 9 illustrates the DE proportion when the 
transcripts partitioned into 10 equal-sized bins based on 
their length. 



cases 1 ,2,3 cases 4,5,6 cases 7,8 




0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 



p_thres p_thres p_thres 

Figure 3 False discovery rate from the simulation. True and estimated false discovery rates are compared across different threshold for posterior 
probability. Solid lines are true values and dashed lines are estimated values averaged over all simulations. Left panel shows the result from 
simulation cases 1 , 2, and 3, where non-null fold change is empirically generated. Results for cases 4, 5, 6 and 7,8 are illustrated on the middle panel 
and right panel, respectively. 
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Figure 4 Simulation results. Operating characteristics for 8 simulation settings are plotted with red, green, and blue lines for the Bayes, DESeq, and 
edgeR methods, respectively. 
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Figure 5 Trace of parameters regarding the mixture distrubution. Trace of parameters regarding the mixture distrubution (a) and distributions 
of fold change estimates for genes classified into EE and DE groups, respectively, by the Bayes' rule (b). 



Bioinformatics annotations of the results 

Pathway-level analysis is one effective way to summarize 
biological relevance of differentially expressed genes. We 
perform gene enrichment analysis using DAVID (http:// 
david.abcc.ncifcrf.gov/). 2,352 DE transcripts inferred 



from our approach are mapped to 1,518 DAVID IDs 
for functional annotation clustering. Cluster 1 (DAVID 
enrichment score: 11.39) represents cellular response 
to the WNV infection. Specifically, pathways in clus- 
ter 10 (score: 2.72) are related to the activation of the 




-5 0 5 

Estimated log fold change 

Figure 6 Result of the Bayesian approach and comparison with other existing methods. Posterior probabilities against estimated fold change 
(a) and consistency between the Bayesian approach and existing approaches when the same number of top-ranked transcripts are chosen (b). 
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Figure 7 Comparison of DE transcripts. Commonly detected transcripts by all three methods are labeled in purple: log-scaled Bayesian estimated 
fold change against log-scaled average expression. Other three panels show DE transcripts detected by each of three methods. They are labeled in 
red, green, and blue for the Bayes, DESeq, and edgeR methods, respectively. 
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Figure 8 Example of uniquely selected by the proposed Bayesian model. Illustration of expresseion values from a transcript detected by the 
proposed method only. 
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Figure 9 DE proportion and transcript length. Proportion of DE 
transcripts over their average expression level. Transcripts are 
partitioned into 1 0 equal-sized bins by their expression levels. The 
proportion of transcripts inferred to be DE is plotted on the y-axis. 
Red, green, and blue lines are from Bayes, DESeq, and edgeR 
methods, respectively. 



transcripts with large treatment effects but low expres- 
sion levels, whereas these transcripts were not inferred 
to be differentially expressed by other approaches. This 
is likely due to the more flexible and adaptable model- 
ing of variance across individuals in our approach. Further 
examination of the characteristics of these top-ranked 
transcripts shows that the proportion of top-ranked tran- 
scripts in the short transcript group is consistent with 
the proportion in the long transcript group. On the other 
hand, the gene sets detected by the existing approaches 
show a bias towards longer transcripts, as has been noted 
in the literature before [48,49]. Our model reduces this 
bias and as a result facilitates detection of some short- 
length differentially expressed transcripts that the other 
approaches miss. 

We have assumed that the log-fold change arises from a 
mixture of two normal distributions. Under DE, the model 
allows the mean of log-fold change distribution not to be 
restricted at zero. By doing so, our proposed model can be 
applied to the data showing asymmetry between over and 
under expression. A normal distributional assumption is 



macrophage after the virus infection. Molecular functions 
of these transcripts are characterized by their cytokine 
production (GO:0001817) in cluster 3. Cluster 8 (score: 
2.89) consists of transcripts that are involved in apop- 
tosis (GO:0042981) and the regulation of programmed 
cell death (GO:0043067). The induction of apoptosis by 
WNV is essential in the regulation of pro-inflammatory 
responses, and has been previously reported in cell lines 
and neuronal cell types [46,47]. These clusters and related 
top pathways are reported on Table 3 with enrichment 
scores and p-values. 

Conclusions 

In this paper, we have presented a hierarchical mixture 
model for the identification of differential gene expres- 
sion from RNA-Seq data motivated by a West Nile Virus 
study, which collected samples as multiple pairs, i.e. 
pre- vs. post-treatment for each individual. While such 
design is common in biological investigations, few existing 
methods analyze such data appropriately. With a hier- 
archical Bayesian mixture model coupled with inference 
through MCMC, our approach incorporates variability 
across genes, individuals, and treatment effects in the con- 
text of a paired experiment. Application to both simulated 
and real data demonstrates that our model and imple- 
mentation is suitable for paired design, having distinct 
advantages compared to the existing methods. 

Simulation study suggests that our Bayesian setting can 
have better power to detect differential gene expression. In 
the real data application, our proposed is able to identify 



Table 3 Selected pathways from the functional analysis 



Term 




Count 


p-value 


Cluster 1 


score: 1 1 .39 






Defense response 


GO:0006952 


106 


5.3E-14 


Response to wonding 


GO:0006954 


90 


1.3E-11 


Inflammatory response 


GO:0009611 


63 


1.0E-10 


Cluster 2 


score: 5.43 






Response to molecule of 
lipopolysaccharide 


GO:0002237 


23 


9.4E-8 


Response to cytokine stimulus 


GO:0034097 


18 


8.0E-5 


Response to bacterium 


GO:0009617 


31 


3.5E-4 


Cluster 3 


score: 5.19 






Regulation of cytokine 
production 


GO:0001817 


41 


2.9E-9 


Positive regulation of cytokine 
production 


GO:0001819 


20 


8.0E-5 


positive regulation of multicel- 
lular organismal process 


GO:0051240 


35 


1.1 E-3 


Cluster 8 


score: 2.89 






Regulation of apoptosis 


GO:0042981 


100 


1 .0E-5 


Regulation of programmed cell 
death 


GO:0043067 


100 


1.5E-5 


Regulation of cell death 


GO:0010941 


100 


1 .8E-5 


Cluster 10 


score: 2.72 






Leukocyte activation 


GO:0045321 


41 


9.2E-6 


Cell activation 


GO:0001775 


46 


1.1 E-5 


Tcell activation 


GO:0046649 


26 


2.0E-5 



Count column indicates the number of DAVID IDs associated with each pathway. 
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shown to be robust from simulation study under empiri- 
cal fold change scenarios. Other possible choices for the 
null genes include a point mass at 0 [50], uniform distribu- 
tion around 0, and a log-Gamma distribution with a mean 
0. Similar distributional assumptions can be made for the 
non-null genes under the two-component mixture set-up. 
Alternatively, one can consider a mixture of three com- 
ponents consisting of equal, over, and under expression 
states. Further extension can be considered by allowing 
variation in the magnitude of expression change across 
individuals. 

Appendix 

Variability across individuals 

The Poisson-Gamma setting (Equation 1 and 2) allows 
extra variance among count expression values [28]. The 
variance of the count is given as 

Var(Y gil ) = E(Var(Y gil \X gi )) + Var(E(Y gil \X gi )) 
= E(N a X gi ) + Var(N a X gi ) 



Therefore, 



Modeling details 

The joint density of z g and Xg is 

p(Xg> Zg) ~ 7ToLogNormal(0, G^)I{z g = 0) 

+ 7TiLogNormal(/xi,a^)/(^ = 1). 

Let 6 be a vector of all model parameters, 0 = 
({oi g }, {fig}, tto, 7Ti, otq , fii, aj 2 ). The complete likelihood of 
the model is 

p(Y,x,z\0) = Ylp(Y g , X g,z g \0) 

g 

= Ylp(Yg\Xg,z g ,0)p( X g\z g ,0)p(z g \0) 



n \Y\pwg> z g>v 



P(Xg\z g ,0)p(z g \6) 



= 11 11 I P(Ygi\k g i,Xg,z g ,6)p(X gi \6)dX gi 
g I i J 

xp(Xg\z g ,6)p(z g \6) 

Here, some details on the integral over X g i follow. 

j p(Y gi \X gi , x g ,z g ,0)p(X gi \0)dX gi 

-I 



(NOX^ (AfaWg p -Nnkgi-NqkgiXg 



ygn'ygti- 



x fk(X gi )dX gi 

_r(%ii +ygi2 + a g ) 



y g n\y gi 2\T(a g ) \p g + N a + N i2 x g 

x ( Nii Y gil ( Ni2Xg ) 

X \P g +Nn+NaXg) \Pg+N a +NaXg) 



g l i i I y&^yg& T (<*g) 

x ( h y 

x ( Nil \ 

X yPg+Nn+Naxg) 



(5) 



nXgj 

Ni2Xg, 

Pg + Nn+Naxg) J 
xp(Xg\z g ,0)p(Zg\0) 



After integrating over the expected gene- and 
individual-specific relative baseline expression (Xgis), the 
posterior density of unknown parameters is proportional 
to the product of likelihood and prior density. 

p(X,z,e\Y)<xp(Y,x,z\9)p(0) 

We use the non-informative prior distributions for 
the unknown model parameters specified in the 
Methods Section. 

Parameter estimates by the Metropolis-Hastings algorithm 
(MCMC) 

We infer the posterior distributions using the Gibbs sam- 
pling [43], which iteratively samples model paramters 
from the conditional distribution of each patermter given 
the other parameters. In this section, we describe the 
procedure for the posterior inference. 

Stepl 

Update otg. The conditional distribution for a g does not 
have a closed form expression. We use the Metropolis- 
Hastings algorithm to sample this parameter. More 
specifically, we update the parameter by proposing 
a new ^ N{a° g ld 1 cr%) at each iteration, where a a is set to 
be 0.1. The proposal is accepted with probability rnin{l, r}, 
where r is the acceptance ratio. 

p(z,x,a" ew ,9°Ji\Y) 



p(z,x,c*f,6°li g \Y) 

where 9°!^ is the current values of the parameters except 
otg and 

p(z,X,<*g,0-a g \Y) cx 1| — 



( h \ ag 

Kfig+Nn+Naxg) 



i2Xg. 

If the proposal is accepted, we replace the old a g with 
the new one. Otherwise, a g stays at the current value. 
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Step2 

Update fig. Similar to sample a g , we propose 
P new ~ N(P 0 id, cfp), where op is set to be 1. The accep- 
tance ratio is calculated as 



Similarly, 0-p g is the vector of parameters except f$ g . 
For the evaluation of the acceptance probability, updated 
value of otg in the Step 1 will be used. 

Step3 

Update (XgtZg) by utilizing generalized Metropolis- 
Hastings. Lewin et al. [38] pointed out that Xg an d z g 
have to be jointly estimated since the supporting space 
of Xg depends on the choice of z g . For example, Xg is a 
point mass at one if z g = 0. To estimate a pair of (Xg>z g ), 
they proposed the state z g first and then updated Xg\%g* 
By utilizing this approach, we adopt the following steps to 
sample (Xg>z g ). 
(Step 3-1) Generate Zg 6W from the Bernoulli distribution, 

with P(z™ w = 0) =7T° ld . 

(Step 3-2) Then, x g ew is proposed from 
LogNorrnal(0, V g ) if Zg 6W = 0. Otherwise, it is sampled 
from LogNormal(M g) V g ). The mean and variance of the 
log-normal proposal distribution are computed from 
the observed counts. First, we collect individuals whose 
pre- and post-treatment counts are non-zero for each 
gene, separately. Then, M g is computed as a median of 

log(^ / A/f ) f° r such individuals. The variance of these 
values can be used as Vg, however, this estimate often 
gives an extreme value. In data analysis, we trim the esti- 
mates at 25th and 75th percentiles when the sample size 
is 10. For small sample case, the median of V g s is used as 
the proposal variance. 



Alternative description 

Define Q(x g ew ,z n g ew \x g ld ,z° g ld ) to be a proposal density 
from the current values (Xg ld > z°g d ) to the proposed values. 
In our approach, the proposal density does not depend 
on the current values, i.e., we use the independence chain 
Metropolis-Hastings. The proposal distribution is given 
by 



Q( \,new „new\^,old „old\ ^—oldj\j(r\ \r \j ( 'new n\ 
[Xg >z g \Xg >z g J ~7T 0 LN(P,Vg)I\Zg =0) 

+ 7tf d LN(M g , V g )I (z n g ew = l) 



The acceptance probabilty is min{l, r} where r is one of 
followings: 



„old 



1, z" 



1 : r : 



z°J d = 1, z" ew = 0 : r = 



„old 



0, z" 



1 : r : 



z°J d = 0, z" ew = 0 : r = 



LN 1 (Xg UW )t(x g KW ) 
LN 1 (xf)t(xf d ) 

LNixf-.Mg, V g ) 
X LN(Xg ew ;M g , V g ) 
LN 0 (x n g ew )t{x n g ew ) 

LN l (xt d )t(x! ld ) 

LN(xf;M g , V g ) 
X LN(x g new ;0,V g ) 
LN x (x n g ew )Kx^ ew ) 
LN 0 ( X f d )t(xf d ) 
LN(xf;0,V g ) 
X LN(Xg* ew ;M g , V g ) 

LN 0 {x™ w )t(x n g eW ) 
LN 0 (x° ld )t(xf d ) 

LN(x° u ;0,V g ) 
X LN( Xg mw ;0,V g ) 



yga 
Jk. 



yia+m^i ' LN 0 is a P rob " 



where t(x e ) = Y\, 

ability density function for log-normal distribution with 
mean zero and variance a^' old . Similarly, LN\ is a log- 
normal density centered at fi" ew and variance a\ 



2,old 



Step4 

Update <Tq , jxi, a\, which are hyper-paramaters from the 
distribution of Xg- Since it has a closed form for the pos- 
terior density conditional on all other parameters, we can 
directly sample those parameters. 



ff0 2 '-~InvGamma(^^, \ £ log( X ,) 2 



z ? =0 



Mi 



" ew ~ Normal 



ttfe = 1) ttfe = 1) , 



a^-InvGamma «^1>, 1 £(l ogx<J - Ml ) 2 



Zn — 1 



where $(z g = 0) = = 0) and %{z g = 1) = 

Z g Kz g = l). 

Step5 

Update the mixing proportions (ttq, 7t\). We assume a 
Dirichilet prior for the mixture probabilities. Using Gibbs 
sampling scheme, these weight parameters are updated 
from Dir{l + JtC% = 0), 1 + B(% = 1)). 
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